Prediction in Baseball

Jim Albert, Emeritus Professor, BGSU

2025-06-01

Introduction

Using Prediction to Make Sense of Baseball Patterns

  • Talk about two issues in baseball, streakiness and growth in home run hitting

  • Propose some models for baseball outcomes and home run hitting

  • Simulate predictions based on fitted models

  • Compare predictions with observed

Streakiness in Baseball

  • Media and fans are fascinated with streaky patterns of hitting

  • Does Streaky Hitting Ability exist?

  • Or maybe we see patterns similar to the streaky patterns in coin flipping with different Head probabilities.

Home Run Hitting

  • What is the reason for the sudden increase in home run hitting in recent seasons?

  • Are the hitters changing their approach?

  • Does the ball construction have an effect?

  • Other factors?

Exploring Streaky Patterns

Looking for True Streakiness

  • Collect individual plate appearance data for all players in a MLB season
  • Focus on patterns of hot and cold hitting
  • Is there evidence that players are truly streaky?
  • Maybe we are observing coin-flipping behavior?

Binary Sequence

  • Observe sequence of hitting data for a player
  • For each plate appearance, observe a “success” (1) or “failure” (0)
  • Focus on pattern of streaks and slumps
  • Look at spacings, the number of “failures” between consecutive “successes”

Different Definitions of Success

  • “success” = HIT
  • “success” = HOME RUN
  • “success” = STRIKEOUT

Example - Mike Schmidt Home Run Spacings

  • In 1980 season, Schmidt hit 48 home runs on these PAs:

25 32 41 45 72 76 86 87 100 131 141 150 160 162 176 178 182 187 221 228 269 301 316 339 342 343 368 406 414 420 425 433 454 455 473 522 540 554 578 588 596 598 604 616 637 640 645 652

  • Spacings are 25, 7, 9, 4, 27, …

Need a Streaky Measure

  • Consider two probability models - Consistent and Streaky
  • Construct a Bayesian measure to distinguish between the models

Geometric Model

  • Let \(y_1, ..., y_n\) denote the observed spacings.
  • Assume \(y_j\) are independent \[ y_j \sim Geometric (p_j) \]
  • Put models on the probabilities \(p_1, ..., p_n\)

Two Models - Consistent and Streaky

  • Model \(C\): Hitter is truly consistent

\[p_1 = ... = p_n = p\]

  • Model \(S\): Hitter is truly streaky

the \(p_j\) are different and distributed according to a Beta(\(a, b\)) curve, for specified values of \(a\) and \(b\)

Bayes Factor

  • Bayes factor in support of streaky \(S\) is \[ BF = \frac{f(y | S)}{f(y | C)} \]

  • If \(\log BF > 0\), support for true streakiness.

  • Here we say that player is streaky if \(\log BF > 0.5\)

Streakiness Pattern?

  • In a particular season, we’ll find some streaky hitters

  • Maybe players are truly consistent and we are observing “chance” streakiness due to multiplicity.

  • Would a consistent model for hitting predict this observed streakiness?

Prediction Exercise

  • Assume Consistent Model where each player has a single probability of success.

  • Estimate probabilities \(P_1, ..., P_N\) for \(N\) players using an exchangeable model.

  • Simulate binary outcomes from Bernoulli distributions using these probability estimates.

  • Using Bayes factor, find the fraction of streaky hitters.

Predictive Simulation

  • Set definition of success (HIT or SO)

  • Simulate 50 replicated datasets from predictive distribution from consistent model fit

  • Plot for each season, the fraction of streaky hitters

  • Compare with observed fraction

HIT - Streakiness in 50 Simulations

HIT - Compare with Observed

HIT as Success

  • Observed fraction of streaky hitters is similar to what one would predict from consistent model.

  • Hard to identify truly streaky hitters using Hit as success.

SO - Streakiness in 50 Simulations

SO Data - Compare with Observed

SO as Success

  • Find more observed streaky players than one would predict based on simulations from a consistent model.

  • So patterns of Strikeout streakiness are “interesting”.

  • Motivates search for hitters who have streaky patterns over their careers.

Understanding Surge in Home Run Hitting

HR Totals in the Statcast Era

Season Home Runs
2015 4909
2016 5610
2017 6105
2018 5585
2019 6776
2021 5944
2022 5215
2023 5868

Focus on In-Play Rates

  • Define the home run rate as the fraction of \(HR\) among all batted balls (\(AB - SO\))\[ HR \, Rate = \frac{HR}{AB - SO} \]

  • Look at history of \(HR\) rates

History of In-Play Home Run Rates

What is Causing the Rise in Home Rate Rates?

  • Fall of 2017 a committee was charged by Major League Baseball to identify the potential causes of the increase in the rate at which home runs were hit from 2015 to 2017.

  • Committee released two reports (May 2018 and December 2019)

Possible Reasons for Increase in HRs

The batters?

  • Changes in characteristics of batted balls
  • Launch angle, exit velocity, and spray angle

The pitchers?

  • Changes in types of pitches
  • Pitch location

Possible Reasons for Increase in HRs

The ball?

  • Changes in how the ball is made?
  • Seam height, core?
  • Drag coefficient (resistance of ball as it travels)?

Possible Reasons for Increase in HRs

Game conditions?

  • Ballpark effect
  • Weather
  • Cold vs. hot temperatures

Process of Hitting a Ball

  • IN-PLAY: Have to put the ball in play

  • HIT IT RIGHT: The batted ball needs to have the “right” launch angle and exit velocity

  • REACH THE SEATS: Given the exit velocity and launch angle, needs to have sufficient distance and height to clear the fence (the carry of ball)

Recent Exploration of Home Run Rates

  • Nine seasons of Statcast data (2015 - 2023) are available
  • Have launch speed and launch angle measurements for all seasons
  • Take a broader perspective on home run hitting

Empirical Approach

  • Look at region of launch angle and exit velocity where most of home runs are hit
  • Look at rate of batted balls in this region – how does it vary by season?
  • Look at rate of home runs for balls hit in this region – how does it vary season?

Launch Vars Where Most HR are Hit (RED Zone)

Balls in Play Rate

  • Interested in rate of “home run likely” (RED Zone) batted balls \[ BIP \, Rate = \frac{HR \, Likely}{BIP} \]
  • Are batters changing their approach?
  • Players getting stronger?

Rate of Balls Hit in RED Zone

Rate of Balls Hit in RED Zone

  • See a general increase in “home run likely” rates over Statcast period
  • Players appear to be changing their hitting approach or they are getting stronger

Home Run Rate in RED Zone

  • What is the chance of a home run given good values of launch angle and exit velocity? \[ HR \, Rate = \frac{HR} {HR \,Likely} \]
  • Characteristic of the baseball
  • Changes in drag coefficient over seasons?

Home Run Rate in RED Zone

Home Run Rate in RED Zone

  • General increase from 2015 to 2017
  • Big dip in 2018, followed by big increase in 2019
  • General decrease from 2019 to 2023
  • These “ball effects” are large

Modeling Approach

  • Focus on the in-play home run rates in July 2021 and July 2022

  • Observe big drop in the HR rate

  • Is it due to the hitter’s approach?

  • Or is it due to the ball?

Graph

Generalized Additive Model

  • Express the logit of the home run probability as \[ \log \left(\frac{P(HR)}{1 - P(HR)}\right) = s(LA, LS) \]

  • \(s()\) is a smooth function of the launch angle (LA) and the launch speed (LS)

  • Generalization of the linear regression model \(y = X \beta + \epsilon\)

Approach

  • Fit a GAM model to the in-play HR data for July 2021

  • Use the model fit to predict the HR rate for July 2022 using the 2022 launch variables

  • By simulation, get a prediction distribution

Graph

Observations

  • Predictions are smaller than the observed 2021 rate. This indicates a change in the hitter launch variables.

  • But the observed 2022 rate is smaller than the prediction distribution – this indicates that the ball is deader in 2022

Aaron Judge

  • Slugger currently playing for Yankees
  • Broke American League HR record with 62 in 2022
  • Currently has hit 330 HR in career

Aaron Judge in 2022

  • Hit 62 home runs during a season when the ball was relatively dead

  • Raises the question: How many home runs would Judge hit during a different season during Statcast era?

Methodology

  • Suppose the different season is 2019.

  • Fit a “2019 ball model” that predicts the probability of a HR in 2019 given values of the launch angle and exit velocity.

  • Collect the launch variables for Judge for all balls put into play. For each BIP, predict P(HR) using 2019 ball model.

  • Sum the probabilities – predict the season HR.

Predict

  • For each Judge’s ball in play in 2022, predict the probability of HR from the launch variables using the 2019 ball model.

  • Sum the probabilities – predict total HR count

  • Can get a 90% prediction interval

Results

  • If Judge was hitting using a 2019 ball, predict he would hit 75 home runs

  • A 90% prediction interval would be (69, 81)

Repeat this method for other Statcast seasons

  • Use GAM model to predict prob(HR) from the launch angle and exit velocity for one season

  • Use this ball model to predict HR probability using 2022 launch variables

  • Sum prediction probabilities

Results

Takeaway

  • Judge only hit 62 home runs in 2022

  • But if he was playing during a different season where the ball was more alive (more carry), the prediction of his 2022 count to be in the 70’s

  • So Judge’s home run achievement is understated

  • Due to this ball bias, we don’t appreciate magnitude of Judge’s accomplishment

Concluding Comments

  • Two important factors in home run hitting are the hitters (values of launch variables) and the ball (carry or drag coefficient).

  • Batters are stronger and changing their hitting approach, leading to higher rates of “HR friendly” balls in play.

  • The composition of the ball has gone through dramatic changes during the Statcast era.

  • Currently the ball is relatively dead compared to previous seasons.

References

  • 2007 Streaky Hitting in Baseball, Journal of Quantitative Analysis of Sports, Vol 4, Issue 1.

  • 2013 Looking at Spacings to Access Streakiness, Journal of Quantitative Analysis of Sports, Vol 9, Issue 2.

  • 2014 Streakiness in Home Run Hitting. Chance, 27(3), 4-9.

  • 2020 The Home Run Explosion, Science Meets Sport, Cambridge Scholars Publishing.

  • 2024 Balls are Traveling Farther in 2024 in Progressive Field (with Alan Nathan), Baseball Prospectus